Fix tokenizer mismatch between benchmark client and sglang server by ZhaiFeiyue · Pull Request #937 · SemiAnalysisAI/InferenceX

ZhaiFeiyue · 2026-03-24T15:01:20Z

Summary

Problem

When benchmarking sglang server performance with the latest sgalng (which uses transformers 5.3.0), we observed a significant performance regression compared to the older 0313-2 image (transformers 4.57.1):

TTFT: 1394ms vs 482ms (~3x slower)
Output throughput: 160 tok/s vs 212 tok/s (~24% drop)

Root Cause

Transformers v5 changed how LlamaTokenizerFast is loaded. During __init__, v5 rebuilds the pre_tokenizer and decoder from scratch using Llama-specific components, discarding the originals defined in tokenizer.json. For models like DeepSeek-R1 that declare LlamaTokenizerFast but actually use a ByteLevel/Sequence tokenizer architecture, v5 incorrectly replaces:

pre_tokenizer: Sequence → Metaspace
decoder: ByteLevel → Sequence

The sglang server already fixes this at startup via _fix_v5_tokenizer_components() in hf_transformers_utils.py, but the benchmark client loads the tokenizer directly via AutoTokenizer.from_pretrained() without these fixes. This mismatch causes the client and server to tokenize text differently:

	Client (unfixed)	Server (fixed)
pre_tokenizer	Metaspace	Sequence
100 random tokens → text → re-encode	102 tokens	531 tokens
7140 token prompt	7140 tokens	35921 tokens

The ~5x token inflation on the server side results in 5x more prefill work, which directly causes the observed TTFT and throughput regression.

Fix

Add _fix_tokenizer_for_sglang() to the benchmark client's get_tokenizer(), applying the same two fixes that sglang server applies:

pre_tokenizer/decoder restoration: Read the original components from tokenizer.json and overwrite the incorrect v5 replacements.
add_bos_token recovery: Restore add_bos_token/add_eos_token flags from tokenizer_config.json that v5 strips during loading.

This is a no-op on transformers v4.

Results

Token count (per request, ~7400 token input):

	Before fix	After fix
Client-side token count	7140	7140
Server-side prompt_tokens	35921 (5x inflated)	7141 (correct)
Prefill chunks	3 (16384+16384+3153)	1 (7141)

Benchmark (10 prompts, concurrency=1, input ~8K, output ~1K, --disable-radix-cache):

Metric	Before fix	After fix	Old image (baseline)
Output throughput (tok/s)	160.49	211.71	212.00
Mean TTFT (ms)	1394.84	215.00	215.26
Mean TPOT (ms)	4.74	4.49	4.26
Mean E2EL (ms)	5796.15	4393.74	4387.54

After fix, new image performance matches old image — no actual regression.

…transformers v5 Transformers v5 incorrectly rebuilds pre_tokenizer/decoder components for models like DeepSeek-R1 that use LlamaTokenizerFast with a non-Llama tokenizer architecture. The sglang server fixes this at startup, but the benchmark client loads the tokenizer without these fixes, causing a ~5x token count inflation (e.g. 7000 tokens -> 35000 tokens) and false performance regressions in TTFT and throughput benchmarks. Apply the same tokenizer fixes (pre_tokenizer/decoder restoration and add_bos_token recovery) that sglang server applies, so client and server tokenize identically. No-op on transformers v4. Made-with: Cursor

functionstackx · 2026-03-24T20:00:16Z

thanks @ZhaiFeiyue is for all models in sglang or just for glm5?

cquil11 · 2026-03-24T20:13:28Z

@ZhaiFeiyue Thank you for identifying this error and for the fix. +1 to functionstackx comment. It will be helpful to identify which configs from amd-master.yaml and nvidia-master.yaml are affected by this so that we can re-run and get accurate perf across the board.

+viz @kedarpotdar-nv

ZhaiFeiyue · 2026-03-25T01:19:12Z

thanks @ZhaiFeiyue is for all models in sglang or just for glm5?

hi @functionstackx we only found this issue for deepseek-R1, not do any test for glm5

…ference: SemiAnalysisAI/InferenceX#937

billishyahao · 2026-03-27T13:51:13Z

@ZhaiFeiyue Thank you for identifying this error and for the fix. +1 to functionstackx comment. It will be helpful to identify which configs from amd-master.yaml and nvidia-master.yaml are affected by this so that we can re-run and get accurate perf across the board.

+viz @kedarpotdar-nv

Hi @cquil11 now no configs are affected by the patch. Only after we upgrade the sglang to latest main, then we will observe the issue. So it is a proactive protection.

cquil11 · 2026-03-27T14:07:46Z

@claude Review

Klaud-Cold · 2026-03-27T14:08:14Z

Claude Code is working…

I'll analyze this and get back to you.

View job run

cquil11

OK. I see that the same change has indeed been applied in SGLang upstream, https://github.com/sgl-project/sglang/blob/9238bd08a2895fa3b7ec79ea567e5c27ac951343/python/sglang/srt/utils/hf_transformers_utils.py#L836

LGTM

claude · 2026-03-27T14:16:16Z

utils/bench_serving/backend_request_func.py

+            from tokenizers import Tokenizer as RawTokenizer
+            tok_file = Path(model_path) / "tokenizer.json"


🔴 The tokenizer fix is silently skipped when a HuggingFace Hub model ID (e.g. 'deepseek-ai/DeepSeek-R1') is passed instead of a local directory path. Path('deepseek-ai/DeepSeek-R1') / 'tokenizer.json' constructs a relative path that doesn't exist on disk, so both tok_file.is_file() and config_file.is_file() return False and the entire fix is bypassed. Since get_model() returns the model ID unchanged when ModelScope is not in use, this covers the common case for most benchmark users — the tokenizer mismatch this PR is intended to fix will not be corrected for them.

Extended reasoning...

What the bug is and how it manifests

The _fix_tokenizer_for_sglang function reads tokenizer configuration files from the local filesystem using Path(model_path) / "tokenizer.json" and Path(model_path) / "tokenizer_config.json". Both reads are guarded by .is_file() checks. If model_path is a HuggingFace Hub model ID like 'deepseek-ai/DeepSeek-R1', neither path exists on disk, so both checks return False and both fixes — the pre_tokenizer/decoder restoration and the add_bos/add_eos recovery — are silently skipped. The function returns the tokenizer unchanged.

The specific code path that triggers it

In get_tokenizer() (line ~517), when pretrained_model_name_or_path is not a local directory, get_model() is called. Without VLLM_USE_MODELSCOPE=true, get_model() simply returns the input string unchanged (lines 439-441). AutoTokenizer.from_pretrained('deepseek-ai/DeepSeek-R1') then succeeds by downloading to the HuggingFace cache. Finally, _fix_tokenizer_for_sglang(tokenizer, 'deepseek-ai/DeepSeek-R1') is called with the Hub ID as model_path. Inside, Path('deepseek-ai/DeepSeek-R1') / 'tokenizer.json' resolves to the relative path deepseek-ai/DeepSeek-R1/tokenizer.json in the current working directory — which does not exist.

Why existing code doesn't prevent it

The code has no fallback to resolve the HF cache location. The is_file() guards are appropriate for local paths but become a silent no-op for Hub IDs. There is no warning, log message, or exception — the function just returns the unmodified tokenizer. Because AutoTokenizer.from_pretrained succeeds silently using the HF cache, callers have no indication the fix was skipped.

What the impact would be

Any user who passes a HuggingFace Hub model ID to the benchmark script (the default and most common usage pattern) will get an unfixed tokenizer. The benchmark client will still tokenize text differently from the sglang server — the exact mismatch this PR intends to resolve. For DeepSeek-R1, this means the client sees ~7,000 tokens but the server sees ~35,000 tokens, causing ~5x TTFT inflation and ~24% throughput regression as described in the PR description. The fix only works if users explicitly pre-download models to a local path.

How to fix it

After loading the tokenizer, resolve the local cache path before calling _fix_tokenizer_for_sglang. The simplest approach is to use tokenizer.name_or_path (which AutoTokenizer populates with the resolved cache directory) or call huggingface_hub.snapshot_download(model_id, local_files_only=True) to get the cache path without triggering a re-download. For example: _fix_tokenizer_for_sglang(tokenizer, tokenizer.name_or_path).

Step-by-step proof

User runs benchmark with --model deepseek-ai/DeepSeek-R1 (no local path).

get_tokenizer('deepseek-ai/DeepSeek-R1') is called.

os.path.exists('deepseek-ai/DeepSeek-R1') → False.

get_model('deepseek-ai/DeepSeek-R1') → returns 'deepseek-ai/DeepSeek-R1' (ModelScope not set).

AutoTokenizer.from_pretrained('deepseek-ai/DeepSeek-R1') → succeeds, tokenizer loaded from ~/.cache/huggingface/hub/....

_fix_tokenizer_for_sglang(tokenizer, 'deepseek-ai/DeepSeek-R1') called.

Path('deepseek-ai/DeepSeek-R1') / 'tokenizer.json' → PosixPath('deepseek-ai/DeepSeek-R1/tokenizer.json').

tok_file.is_file() → False → pre_tokenizer/decoder fix skipped.

config_file.is_file() → False → add_bos/add_eos fix skipped.

Tokenizer returned with v5 corrupted pre_tokenizer (Metaspace instead of Sequence) and decoder (Sequence instead of ByteLevel).

Client tokenizes 7,000-token prompt → server receives and re-tokenizes to ~35,000 tokens → 5x TTFT inflation.

claude · 2026-03-27T14:16:16Z

utils/bench_serving/backend_request_func.py

+                loaded_pre = type(backend.pre_tokenizer).__name__ if backend.pre_tokenizer else None
+                if raw_pre and loaded_pre and raw_pre != loaded_pre:
+                    backend.pre_tokenizer = raw.pre_tokenizer


🟡 In _fix_tokenizer_for_sglang, backend.decoder = raw.decoder is assigned unconditionally without checking if raw.decoder is None. If a tokenizer.json defines a pre_tokenizer but no decoder, this silently nulls out the existing working decoder on an already-loaded tokenizer. A one-line guard (if raw.decoder is not None) would make this assignment safe.

Extended reasoning...

What the bug is: At lines 477-478, when a pre_tokenizer type mismatch is detected, the code unconditionally assigns:

backend.pre_tokenizer = raw.pre_tokenizer backend.decoder = raw.decoder

The entry condition (if raw_pre and loaded_pre and raw_pre \!= loaded_pre) only validates that both tokenizers have a pre_tokenizer component — it does not check whether raw.decoder is non-None before performing the decoder assignment.

The specific code path: If tokenizer.json defines a pre_tokenizer but omits a decoder key (or sets it to null), raw.decoder will be None. The code will then execute backend.decoder = None, silently overwriting a previously valid decoder on the already-loaded tokenizer.

Why existing guards don't prevent it: The condition guards raw.pre_tokenizer (via raw_pre) and backend.pre_tokenizer (via loaded_pre) but makes no assertion about raw.decoder. The guard structure was designed to check pre_tokenizer consistency, not decoder availability. The surrounding try/except Exception: pass also means any later decoding failure — which would be a downstream consequence, not a local exception — would be completely silent and hard to trace.

Impact: Any tokenizer.json that specifies a pre_tokenizer without a decoder would silently break all token-to-text decoding after this function runs. Since the failure happens after the function returns, it would manifest as garbage output or crashes far from the assignment site, making the root cause very difficult to diagnose.

Step-by-step proof:

A model's tokenizer.json contains: {"pre_tokenizer": {"type": "Sequence", ...}} but no "decoder" key.

raw = RawTokenizer.from_file(str(tok_file)) loads this file; raw.decoder is None.

raw_pre = "Sequence", loaded_pre = "Metaspace" (v5 replacement) → condition is True.

backend.pre_tokenizer = raw.pre_tokenizer — correct.

backend.decoder = raw.decoder → backend.decoder = None — decoder destroyed silently.

Any subsequent tokenizer.decode(token_ids) call now fails or produces empty output.

Addressing the refutation: The refutation correctly notes that real-world tokenizer.json files with a pre_tokenizer almost universally also define a decoder (DeepSeek-R1 does have a ByteLevel decoder). In practice this scenario is unlikely to trigger for the intended targets. However, the defensive None-check is a one-line fix with zero downside, and code that silently corrupts state depending on an implicit invariant of external files is fragile. The missing check is a real defect, even if it is low severity.

Fix:

if raw_pre and loaded_pre and raw_pre \!= loaded_pre: backend.pre_tokenizer = raw.pre_tokenizer if raw.decoder is not None: backend.decoder = raw.decoder

ZhaiFeiyue requested a review from a team March 24, 2026 15:01

github-project-automation bot added this to InferenceMAX Board Mar 24, 2026

functionstackx requested review from Oseltamivir and cquil11 March 24, 2026 20:00

Merge branch 'main' into fix_tokenizer

7e573e1

kkHuang-amd pushed a commit to kkHuang-amd/useful-scripts that referenced this pull request Mar 26, 2026

Fix tokenizer mismatch between benchmark client and sglang server, re…

0335915

…ference: SemiAnalysisAI/InferenceX#937

Update backend_request_func.py

4e697c6

cquil11 approved these changes Mar 27, 2026

View reviewed changes

cquil11 merged commit ef01c22 into SemiAnalysisAI:main Mar 27, 2026

github-project-automation bot moved this to Done in InferenceMAX Board Mar 27, 2026

claude bot reviewed Mar 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix tokenizer mismatch between benchmark client and sglang server #937

Fix tokenizer mismatch between benchmark client and sglang server #937
cquil11 merged 3 commits intoSemiAnalysisAI:mainfrom
ZhaiFeiyue:fix_tokenizer

ZhaiFeiyue commented Mar 24, 2026

Uh oh!

functionstackx commented Mar 24, 2026

Uh oh!

cquil11 commented Mar 24, 2026

Uh oh!

ZhaiFeiyue commented Mar 25, 2026

Uh oh!

billishyahao commented Mar 27, 2026

Uh oh!

cquil11 commented Mar 27, 2026

Uh oh!

Klaud-Cold commented Mar 27, 2026

Uh oh!

cquil11 left a comment

Uh oh!

claude bot Mar 27, 2026

Uh oh!

claude bot Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

		from tokenizers import Tokenizer as RawTokenizer
		tok_file = Path(model_path) / "tokenizer.json"

Conversation

ZhaiFeiyue commented Mar 24, 2026

Summary

Problem

Root Cause

Fix

Results

Uh oh!

functionstackx commented Mar 24, 2026

Uh oh!

cquil11 commented Mar 24, 2026

Uh oh!

ZhaiFeiyue commented Mar 25, 2026

Uh oh!

billishyahao commented Mar 27, 2026

Uh oh!

cquil11 commented Mar 27, 2026

Uh oh!

Klaud-Cold commented Mar 27, 2026

Uh oh!

cquil11 left a comment

Choose a reason for hiding this comment

Uh oh!

claude bot Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants